Model Selection

Multimodal image-text understanding

# Multimodal image-text understanding

Qwen2.5 VL 3B Instruct GGUF

Qwen2.5-VL-3B-Instruct is a 3B-parameter multimodal model supporting image-text generation tasks, specifically optimized for vision capabilities in llama.cpp.

Text-to-Image English

Gme Qwen2 VL 2B Instruct GGUF

This is a quantized version of a multimodal model that supports both English and Chinese, suitable for image-text to text tasks.

Image-to-Text Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase